-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/1492 extend timestamp config #1669
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! implementation idea and code structure is good. duckdb was a hard case because we had technical debt where we silently created columns without timezones if precision required it. now we can do it right
my take: try to implement postgres and snowflake next. if you need any credentials ping us but I think you have access to CI vault
thx for the tests
@rudolfix I've implemented your review comments, and added more tests. I'll review & fix the CICD failing tests. |
Added postgresql dlt's implementation and tests. Links to my notebooks:
Working on Snowflake. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really good! I think we are really close to close it.
- could we support BigQuery and Athena? those are popular destinations
- for other destinations log a warning that we do not support ts without timezones
then we are good to go. I like your tests
@rudolfix - Comment updated. warning log added. BigQuery - I attempted to implement the timezone flag for BigQuery using both I have documented my findings here: Athena - Athena only supports a single File uploaded to
I used the following DDL statement to create the table: CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
id INT,
t TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 's3://test-athena-julian/test'; |
58a2150
to
3a4613c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I have suggestion to move docs to different sections. the code is good. thanks for investigating and cleaning up timestamp mess :)
|
||
`duckdb` can create unique indexes for columns with `unique` hints. However, **this feature is disabled by default** as it can significantly slow down data loading. | ||
|
||
## Supported column flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are still called hints. I'd move this section under
Data loading
and replace this header with
Data types
(it is used in several other destinations)
please apply this to other docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Question: What about this section?
## Supported column hints
`duckdb` can create unique indexes for columns with `unique` hints. However, **this feature is disabled by default** as it can significantly slow down data loading.
It was there before this PR. Should we keep it as it is, or move it under Data Loading as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* feat: add timezone flag to configure timestamp data * fix: delete timezone init * test: add duckdb timestamps with timezone * test: fix resource hints for timestamp * test: correct duckdb timestamps * test: timezone tests for parquet files * exp: add notebook with timestamp exploration * test: refactor timestamp tests * test: simplified tests and extended experiments * exp: timestamp exp for duckdb and parquet * fix: add pyarrow reflection for timezone flag * fix lint errors * fix: CI/CD move tests pyarrow module * fix: pyarrow timezone defaults true * refactor: typemapper signatures * fix: duckdb timestamp config * docs: updated duckdb.md timestamps * fix: revert duckdb timestamp defaults * fix: restore duckdb timestamp default * fix: duckdb timestamp mapper * fix: delete notebook * docs: added timestamp and timezone section * refactor: duckdb precision exception message * feat: postgres timestamp timezone config * fix: postgres timestamp precision * fix: postgres timezone false case * feat: add snowflake timezone and precision flag * test: postgres invalid timestamp precision * test: unified timestamp invalid precision * test: unified column flag timezone * chore: add warn log for unsupported timezone or precision flag * docs: timezone and precision flags for timestamps * fix: none case error * docs: add duckdb default precision * fix: typing errors * rebase: formatted files from upstream devel * fix: warning message and reference TODO * test: delete duplicated input_data array * docs: moved timestamp config to data types section * fix: lint and format * fix: lint local errors
* feat: add timezone flag to configure timestamp data * fix: delete timezone init * test: add duckdb timestamps with timezone * test: fix resource hints for timestamp * test: correct duckdb timestamps * test: timezone tests for parquet files * exp: add notebook with timestamp exploration * test: refactor timestamp tests * test: simplified tests and extended experiments * exp: timestamp exp for duckdb and parquet * fix: add pyarrow reflection for timezone flag * fix lint errors * fix: CI/CD move tests pyarrow module * fix: pyarrow timezone defaults true * refactor: typemapper signatures * fix: duckdb timestamp config * docs: updated duckdb.md timestamps * fix: revert duckdb timestamp defaults * fix: restore duckdb timestamp default * fix: duckdb timestamp mapper * fix: delete notebook * docs: added timestamp and timezone section * refactor: duckdb precision exception message * feat: postgres timestamp timezone config * fix: postgres timestamp precision * fix: postgres timezone false case * feat: add snowflake timezone and precision flag * test: postgres invalid timestamp precision * test: unified timestamp invalid precision * test: unified column flag timezone * chore: add warn log for unsupported timezone or precision flag * docs: timezone and precision flags for timestamps * fix: none case error * docs: add duckdb default precision * fix: typing errors * rebase: formatted files from upstream devel * fix: warning message and reference TODO * test: delete duplicated input_data array * docs: moved timestamp config to data types section * fix: lint and format * fix: lint local errors
Description
dlt
does not support timestamps without timezones and to be able to pass such timestamps form source to destination we need to extend the core library.The scope of this PR is to implement a proof of concept (POC) for DuckDB and parquet. This includes updating the type mapper and adding tests where the timezone is passed via columns.
Additionally, a complementary notebook with various experiments is included in this PR.
https://github.com/donotpush/dlt-notebooks/blob/main/notebooks/duckdb-exploration.ipynb
@rudolfix requirements to solve #1492 :
TColumnType
that will say if data_type contains timezone or not (or define time zone as string, with default to "UTC"). Seeprecision
andscale
hints already there. The idea is the same.pyarrow.py
contains functions that convert dlt schema into arrow schema and vice versa. those functions needs to be upgradedRelated Issues